Framing the Problem

Problem Recognition

The business problem analyzed in the present paper is the investigation of the consumer’s behavior towards the use of “Citi Bike” in New York City (NYC) from 2018 to 2020. Through the application of analytical and descriptive statistical concepts and the use of specific R programming tools, this paper aims to explain the variance of NYC Citi Bike rides within the examined time frame by identifying potential explanatory variables and answering specific questions, such as “Does the user gender affect the decision to take a ride with NYC Citi Bike?”, to properly understand which of the chosen variables can be considered “statistically significant”. The analysis of the “NYC Citi Bike” user’s behavior will also provide useful insights on any missed opportunity of attracting new potential costumers’ segments or improving retention within the existing ones.

 

Review of Previous findings

During the last decade, NYC has seen an unpreceded increase in the number of residents, jobs, and tourists. This has inevitably caused a growing demand for transportation. Moreover, the rise of several green movements and the consequent mutating New Yorkers’ behavior, healthier and more environmentally friendly, has led to a natural change in the use and the choice of different types of transport.

“Now more than ever we see that there’s widespread demand to bike around NYC”, said @Katherine Willis, an activist of Transportation Alternatives. New Yorkers are shifting towards cycling as a preferred way of moving (Raskind and Meyer, 2020) as cycling in NYC is cheaper and faster than fuel-powered transportation, plus it is eco-friendly. An unpreceded shift that has also been caused by the coronavirus pandemic, which has provoked the biking explosion, increasing the active travels by 42% globally. Therefore, as mentioned also in the WSJ (2020a), this disruption is posing a unique opportunity for the NYC urban cycling to acquire new ground and gain more market share (WSJ, 2020b).

In such a scenario, “Citi Bike” is the leader in the NYC bike-sharing system and the demand for its bikes is increasing at a record high.

 

 

Solving the problem

Data Collection and Variable Selection

The main data source used is the “NYC Citi Bike Trip History”, published on the Citi Bike website, which gathers hourly historical data about the NYC Citi Bike’s rides between 2018 and 2020 from each of its bike station, counting for 1,212,194 observations and 8 variables. Follow the dataset’s variables: year, month, day, daily time slots, user type, user gender, and user birth year.

A secondary data source was used. This is the “NYC Weather History” published by the Weather Underground website, which gathers daily historical data about NYC weather between 2018 to 2020, counting 1,071 observations and 9 variables. Follow the dataset’s variables: year, month, day, average temperature (F), average dew point (F), average humidity (%), average wind speed (mph), average pressure, and total precipitation(in). This was used in support of the Citi Bike usage investigation.

 

Data Wrangling: Cleaning, Structuring and Transforming

The NYC Citi Bikes Trip History Dataset (2018-2020)

Data Cleaning with Tableau Prep

Since R not support the size of the original “The NYC Citi Bikes Trip History Dataset”, giving the error “Error: cannot allocate vector of size 6.2 Gb in R”, it has been tidied and partially cleaned using Tableau Prep. Below the list of the steps followed to reduce the original dataset of 18M observations and 11 variables to 1.2M observations and 8 variables:

  • STEP 1: The 12 NYC Citi Bike monthly datasets of each year have been connected to a specific workflow.

  • STEP 2: In each workflow, the 12 NYC Citi Bike monthly datasets have been clustered into groups of two using the union function.

  • STEP 3: Each bimonthly union has been cleaned by deleting all the unnecessary variables, decreasing the union width from 11 to 8.

  • STEP 4: In each bimonthly union a calculated field has been created, using the IF() and DEPART(), functions to identify the daily time slots in which each observation has been recorded. As a consequence, a new column called “Daily_Time_Slots” has been created, with the following categories: Early Morning, Late Morning, Early Afternoon, Late Afternoon, Evening, and Late Evening.

  • STEP 5: Each bimonthly union observation has been first grouped by year, month, day, daily_time_slot, user_type, gender, and user_birth_year. Then, a new column called “Rides_Count” has been created to properly calculate the distinct counts of the recorded observation based on the identified groups.

  • STEP 6:A union of the 6 bi-monthly unions has been created.

  • STEP 7:The created flow has been run to properly create and save the cleaned output.

 

Below, an example of the work done Tableau Prep workflow, in which the previous steps have been applied. Note that same process has been followed three times, one for each year.

 

 

 

Data Cleaning with R

To properly examine each variable within the dataset, the first five rows have been displayed using the head() function. After having depicted the the original dataset, converting some variables’ data types proved essential and required specific data manipulations. For example, the “Daily_Time_Slots” data type has been converted from double to factor using the as.factor() function. Moreover, the two levels of the categorical variable “User_Gender” have been renamed by name changing 1 to “Male” and 2 to “Female”. The same approach has been applied to rename the “Customer” level of the categorical variable “User_Type” to “Occasional”. Lastly, three new variables, “Age”, “Age Group” and “Season”, have been created using the mutate() function and joined to the original “nyc_citibikes” dataset, creating a new one called “nyc_citibikes_wrangled”. Following the highlighted chunks hold the code of some of the most important data wrangling operations performed.

 

# Renaming "User_Gender" levels 
levels(nyc_citibikes$User_Gender) <- c("Male", "Female")

# Renaming "User_Type" levels 
levels(nyc_citibikes$User_Type) <- c("Occasional", "Subscriber")
# Creating a new numerical variable that specifies the user age
Age <- mutate(nyc_citibikes,Age = 2020 - User_Birth_Year)
# Creating a new categorical variable that specifies the seasons of the year
Season <- mutate(nyc_citibikes, Season = ifelse(Month %in% c(12, 1, 2), "Winter",
                                                ifelse(Month %in% c(3, 4, 5), "Spring",
                                                       ifelse(Month %in% c(6, 7, 8), "Summer", 
                                                              ifelse(Month %in% c(9, 10, 11), "Fall", NA)))))

 

To better visualize the new wrangled dataset, the first 5 rows of “nyc_citibikes_wrangled” have been displayed.

 

Year Month Day Daily_Time_Slots User_Type User_Gender User_Birth_Year Rides_Count Season Age Age_Group
2018 1 2 Evening Subscriber Male 1962 47 Winter 58 Middel-Aged Adults
2018 1 15 Late Afternoon Subscriber Male 1975 68 Winter 45 Adults
2018 1 28 Late Morning Occasional Male 1965 1 Winter 55 Middel-Aged Adults
2018 1 24 Early Afternoon Subscriber Female 1999 1 Winter 21 Young Adults
2018 4 10 Early Afternoon Occasional Female 1989 3 Spring 31 Young Adults

 

The NYC Weather History Dataset (2018-2020)

After having depicted the first rows of the original dataset, converting some data types proved again essential for the The NYC Weather History Dataset and required specific data manipulations. Moreover, a new variable called “Season” has been created using the mutate() and joined to the initial “nyc_weather” dataset, generating a new one called “nyc_weather_wrangled”. Lastly, all the variables’ names have been renamed to facilitate the reading and the analysis of the dataset. Following the highlighted chunk holds the code of one of the most important data wrangling operation performed.

 

# Example of renaming a variable
names(nyc_weather_wrangled)[names(nyc_weather_wrangled) == "Precipitation_(in)_Tot"] <- "Total_Precipitation"

 

The first five rows of the new dataset, “nyc_weather_wrangled”, have been displayed to highlight the data manipulations results made throughout the data wrangling process.

 

Year Month Day Avg_Temperature Avg_Dew_Point Avg_Humidity Avg_Wind_Speed Pressure _(Hg)_Avg Total_Precipitation Season
2018 1 1 13.3 -3.1 49.0 17.6 30.4 0.00 Winter
2018 1 2 19.2 4.6 53.4 15.6 30.4 0.00 Winter
2018 1 3 22.3 5.5 49.1 8.3 30.3 0.00 Winter
2018 1 4 24.0 19.3 83.1 28.0 29.5 0.04 Winter
2018 1 5 13.2 -2.9 49.6 25.1 29.8 0.45 Winter

 

 

Data Analysis

Citi Bike Users: Age Impact on NYC Citi Bike Rides

By analyzing the users’ age distribution in depth, some erroneous users’ birthdates have been detected. Since these values are contaminating the age distribution of the Citi Bike users and might lead to data distortion and false investigation results, it is necessary to drop all of them before moving on with the analysis. Calling each user to verify the birthdate and age would have been on top of too expensive and time-consuming, not feasible at all. Instead, the age distribution quantiles and common sense have been used to decide which age values drop.

 

Dropping the Contaminated Data

To better understand the distribution of the “Age” variable, the quantiles have been calculated. According to the quantile analysis, the contaminated age range mainly between 102 and 163 years old. However, common sense suggests that a user of 102 would not ride a bicycle. Moreover, in 2019 it has been estimated that the United States life expectancy was equal to 78.87 years. Therefore, it has been decided to drop all the age values greater than 80. All the erroneous data, which were affecting the 3.9% of the total observations, have been filtered out from the “nyc_citibikes_wrangled” and a new object called “nyc_citibikes_not_contaminated” has been created to store the filtered dataset (1,184,441 observations; 10 variables).

 

Dropped all the contaminated values, the mean age of Citi Bike users is 44 years old. A smaller and more significant mean value than the one, equal to 46 years, that would have been obtained by not dropping the erroneous values.

 

 

Testing the First Investigation Hypothesis

Before testing the hypothesis, a bar chart between age and rides has been plotted to find whether or not exist an underlying pattern between the two variables. Looking at the graph it is possible to see a negative relationship between age and rides count.

 

 

Now, let’s test the first investigation hypothesis: “The user age has an effect on the NYC Citi Bike rides” as follow:

\[ \begin{aligned} H_0 &: \text{The user age has no effect on the NYC Citi Bike rides} \\ H_A &: \text{The user age has an effect on the NYC Citi Bike rides} \end{aligned} \]

 

If the output of the following linear model will prove the null hypothesis, the first investigation hypothesis should be reconsidered. Moreover, since the “Rides_Count” distribution is highly skewed a log transformation is required.

 

\[ \text{log(Rides)} = \beta_0 + \beta_1 \text{(Age)} + \epsilon \]

 

The model created has an intercept equal to 3.745 and a slope equal to -0.024. If the user age increases by one unit, the expected rides decrease by 0.024 unit on average. Moreover, the p-values are very close to zero, meaning that the explanatory variable “Age” is highly significant. Therefore, we can reject the null hypothesis which allows us to conclude that there is a negative relationship between age and rides count.To conclude, the first investigation hypothesis can be confirmed.

 

 

Citi Bike Users: Gender impact on NYC Citi Bike Rides

The following bar chart has been plotted to investigate the existence of underlying trends in Citi Bike rides based on the user’s gender.

 

 

Males have been the leading user segment of Citi Bike from 2018 to 2020. From 2018 to 2019, the Male usage of Citi Bike has increased by 17.42% (moving from 11,950,869 to 14,032,973 total rides), while the female usage has increased by 20.59% (moving from 4,086,438 to 4,927,933 total rides). Moreover, while from 2020 to 2019, the Citi Bike usage has instead decreased by 21,05% for males, it has increased by 6.4% for females. Overall, the total rides count increased by 18,23% from 2018 to 2019 and decreased by 13.92% from 2019-2019 following the usage trend in male users.

 

 

 

Testing the Second Investigation Hypothesis

For the underlined pattern between user gender and Citi Bike rides just discovered, a second investigation hypothesis, “The user gender has an effect on the NYC Citi Bike ride”, should be tested as follow:

 

\[ \begin{aligned} H_0 &: \text{The user gender has no effect on the NYC Citi Bike rides} \\ H_A &: \text{The user gender has an effect on the NYC Citi Bike rides} \end{aligned} \]

 

If the output of the following linear model will prove the null hypothesis, the second investigation hypothesis should be reconsidered.

 

\[ \text{log(Rides)} = \beta_0 + \beta_1 \text{(User Gender)} + \epsilon \]

 

The model created has an intercept equal to 2.97 and a slope equal to -0.63 if the user gender is female. The p-value for the dummy variable “User_GenderFemale” is very significant, suggesting that there is a statistical evidence of a difference in NYC Citi Bike rides between genders. Therefore, we can reject the null hypothesis and the second investigation hypothesis can be confirmed. In detail, a female addition within the NYC Citi Bike users’ base, decreases the NYC Citi Bike rides by -0.631 on average.

 

 

Citi Bike Users: The Impact of User Type on Seasonal NYC Citi Bike Rides

The following charts have been plotted to investigate the existence of underlying trends in Citi Bike rides based on user type and season.

 

 

Subscribers users have been the leading NYC Citi Bike user type segment from 2018 to 2020, reaching a total of 46,355,536 rides and, consequently, representing the 90.32% of the total usage. Values significantly higher than the ones registered for occasional users, which have only reached a total of 4,964,985 in the last three years. For both user types the best time to cycling is Summer, which has reached within 2018 and 2020 a total of 16,715,023 rides, followed by Fall, Spring, and Winter. As common sense might suggest, Winter is the least favorite season to ride a bicycle, counting only the 14.56% of the total rides registered between 2018 and 2020. Moreover, throughout the various seasons, while the number of subscribers remains similar, the number of occasional users varies largely probably due to a greater influx of international and non-international tourists in specific seasons of the year. To, conclude the plotted graphs and the analyzed data summary show an underlying pattern between rides count, user type and season.

 

 

 

Testing the Third Investigation Hypothesis

For the underlined pattern just discovered, a third investigation hypothesis, “The user type and the season have an effect on the NYC Citi Bike ride”, should be tested as follow:

 

\[ \begin{aligned} H_0 &: \text{The user type and the season have no effect on the NYC Citi Bike rides} \\ H_A &: \text{The user type and the season have an effect on the NYC Citi Bike rides} \end{aligned} \]

 

If the output of the following linear model will prove the null hypothesis, the third investigation hypothesis should be reconsidered.

 

\[ \text{log(Rides)} = \beta_0 + \beta_1 \text{(User Type)} + \beta_2 \text{(Season)} + \beta_3 \text{(User Type*Season)} + \epsilon \]

 

The model created has an intercept equal to 1.78, a beta_1 equal to 1.675 if the user type is a subscriber, a beta_2, and a beta_3 which change according to the value assumed by the categorical explanatory value “Season”. While an addition in the NYC Citi Bike users’ base of a subscriber increases the NYC Citi Bike rides by 1.675 on average, Winter and Spring have a negative impact. Except for the interaction variable “User_TypeSubscriber:SeasonSpring”, which as a p-value of 0.096, all the other p-values are very close to zero, meaning that, considered individually, the chosen explanatory variables are highly significant. In other words, there is statistical evidence of a difference in NYC Citi Bike rides between user types and seasons. Moreover, according also to the F statistic and its significance level, the null hypothesis has to be rejected. To conclude, the third investigation hypothesis can be confirmed.

 

 

Citi Bike Users: Impact of the Daily Time Slots on NYC Citi Bike rides

The following chart has been plotted to investigate the existence of underlying trends in Citi Bike rides based on daily time slots.

 

 

According to the graph plotted above, the crowdest time to ride has been “Late Afternoon” between 2018 and 2020, reaching a total of 12,108,918 rides, and, consequently, representing the 23.6% of the total usage. Late afternoon is followed by “Evening”, which represent the 20.28%, reaching a total of 10,407,186 rides. The least crowd hour to ride is instead “Late Evening”, which goes from 9 p.m to 12 a.m. and represent only the 6.8%. Looking at data recorded each year, something interesting can be seen: while 2018 and 2019 have an identical rank in terms of daily time slots, 2020 differentiates by identifying “Early Afternoon” instead of “Late Morning” as the third crowdest time slot. Such a discrepancy could have been caused by a change in customer behavior in response to the global pandemic, which involves shorter working days in most of the sectors. To conclude, analyzing the three-years data and the graph plotted, the difference in rides count due to the daily time slots identified is not so evident. Therefore, the explanatory variable might not be considered that significant to explain the NYC Citi Bike rides variance.

 

 

 

Testing the Forth Investigation Hypothesis

Since the underlined pattern between daily time slots and rides count just discovered seams not be that strong, a forth investigation hypothesis, “The daily time slots have no effect on the NYC Citi Bike ride”, should be tested as follow:

 

\[ \begin{aligned} H_0 &: \text{The daily time slots have no effect on the NYC Citi Bike rides} \\ H_A &: \text{The daily time slots have an effect on the NYC Citi Bike rides} \end{aligned} \]

 

This time, if the output of the following model will reject the null hypothesis, the forth investigation hypothesis should be reconsidered.

 

\[ \text{log(Rides)} = \beta_0 + \beta_1 \text{(Daily Time Slots)} + \epsilon \]

 

The model created has an intercept equal to 2.78 and a beta_1 which change accordingly to the value assumed by the categorical explanatory value “Daily_Time_Slots”. While “Evening” and “Late Afternoon” are characterized by a positive slope, “Early Morning”, “Late Evening” and “Late Morning” by a negative one. Moreover, all the individual p-values are very close to zero, meaning that the chosen explanatory variable is highly significant. In other words, there is statistical evidence of a difference in NYC Citi Bike rides between daily time slots. To conclude, the null hypothesis has to be rejected and the fourth investigation hypothesis has to be reconsidered.

 

 

Citi Bike Users: Impact of the Temperature on NYC Citi Bike rides

The following chart has been plotted to investigate the existence of underlying trends in NYC Citi Bike rides based on temperature. In detail, the plotted graph will consider only the month of October (2019), in which usually the NYC temperature fluctuates the most. It has been assumed that this fluctuation combined with the count of total rides recorded each day will give a more realistic look into the examined relationship.

 

 

By analyzing the chart above, it is possible to see that in specific days of October 2019, the downward trend of the average temperate is followed by a downward trend in the number of NYC Citi Bike used. In detail, this positive relationship between average temperature and NYC Citi Bike rides can be clearly seen on the 5th, 9th, or 19th of the month. However, this positive relationship is not always true, as, for example, on the 7th or on the 31st in which the temperature is on average higher than other days, but NYC City Bike usage is on average lower.

 

Testing the Fifth Investigation Hypothesis

Since the underlined pattern between average temperature and rides count just discovered seams not be that clear and strong, a fifth investigation hypothesis, “The average temperature has no effect on the NYC Citi Bike ride”, should be tested as follow:

 

\[ \begin{aligned} H_0 &: \text{The average temperature has no effect on the NYC Citi Bike rides} \\ H_A &: \text{The average temperature has an effect on the NYC Citi Bike rides} \end{aligned} \]

 

If the output of the following model will prove the null hypothesis, the fifth investigation hypothesis should be reconsidered.

 

\[ \text{log(Rides)} = \beta_0 + \beta_1 \text{(Avg Temperature)} + \epsilon \]

 

 

The model created has an intercept equal to 2.041 and a slope equal to 0.011. If the temperature rises by one unit, the estimated responsible variable will increase by 0.011. Moreover, the p-values are very close to zero, meaning that the explanatory variable “Avg_Temperature” is highly significant. Therefore, we can reject the null hypothesis which allows us to conclude that there is a relationship between average temperature and rides count. To conclude, the first investigation hypothesis has to be reconsidered.

 

 

Investigation Updates and Modifications

After having tested the five investigation hypotheses, some modifications have to be done. First, contrary to what was initially guessed, the explanatory variable “Daily_Time_Slots” is statistically significant. Therefore, it has to be necessarily deeper analyzed in the following modeling subsection. Likewise, also the explanatory variable “Avg_Temperature” has to be taken into consideration as it is highly significant.

The five simple linear models created analyze the NYC Citi Bike rides over a specific individual explanatory variable. However, since all the explanatory variables considered are highly significant although each model created has a low r-squared, it is possible to affirm that every single linear model is missing some omitted variables: NYC Citi Bike rides not only depend on a single explanatory variable but simultaneously by multiple ones. Therefore, the following section will focus on the creation of a multiple regression model.

 

 

Modelling

The goal of the present section is finding the best-fitted lm() line, whose coefficients minimize the difference between the predicted and the observed value solving the minimization problem. The following multiple regression extends the previous models with multiple variables aiming to pick up those omitted variables signals that those models fail to pick. Follow the estimated multiple regression formula created to infer about the NYC Citi Bike Users’ population:

 

\[ \begin{aligned} \text{Log(Ride)} &= f(\text{Age}) + f(\text{User Gender} ) + f(\text{User Type}) + f(\text{Season}) + f(\text{Daily Time Slot}) + f(\text{Average Temperature}) + \epsilon \\ &= \beta_0 + \beta_1\text{Age} + \beta_2\text{User Gender} + \beta_3\text{User Type} + \beta_4\text{Season} + \beta_5\text{Daily Time Slot} + \beta_6\text{Average Temperature} + \epsilon \end{aligned} \]

 

The regression output below, highlights an Adjusted R-Square equal to 0.53, meaning that the explanatory variables chosen are able to explain the 53% of response variable variation. With a residual standard error of 1.106 on 1178490 degrees of freedom, the present lm model has an intercept equal to 2.797. Most of the regression coefficients are negatively related to NYC Citi Bikes rides, except for “User_TypeSubscriber”, “Daily_Time_SlotsLate Afternoon”, and “Avg_Temperature”. Individually, all the p-values are extremely close to zero meaning that there all the chosen explanatory variables are highly significant. In detail, a unit increase in age is associated with a -4.087e-02 decrease in NYC Citi Bike rides, controlling for the remaining variables. Likewise, a new “Subscriber” increase the NYC Citi Bike rides by 2.161 NYC Citi Bike rides, on average.

 

## 
## Call:
## lm(formula = log_rides ~ Age + User_Gender + User_Type + Season + 
##     Daily_Time_Slots + Avg_Temperature, data = nyc_total)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9295 -0.6390  0.1856  0.7724  4.3351 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.797e+00  7.957e-03  351.57   <2e-16 ***
## Age                            -4.087e-02  6.363e-05 -642.24   <2e-16 ***
## User_GenderFemale              -7.689e-01  2.045e-03 -375.94   <2e-16 ***
## User_TypeSubscriber             2.161e+00  2.212e-03  976.68   <2e-16 ***
## SeasonSpring                   -2.295e-01  2.949e-03  -77.84   <2e-16 ***
## SeasonSummer                   -2.000e-01  3.278e-03  -61.01   <2e-16 ***
## SeasonWinter                   -2.612e-01  3.903e-03  -66.93   <2e-16 ***
## Daily_Time_SlotsEarly Morning  -5.366e-01  3.540e-03 -151.59   <2e-16 ***
## Daily_Time_SlotsEvening        -8.955e-02  3.462e-03  -25.87   <2e-16 ***
## Daily_Time_SlotsLate Afternoon  1.815e-01  3.390e-03   53.55   <2e-16 ***
## Daily_Time_SlotsLate Evening   -9.500e-01  3.631e-03 -261.67   <2e-16 ***
## Daily_Time_SlotsLate Morning   -5.059e-02  3.411e-03  -14.83   <2e-16 ***
## Avg_Temperature                 1.823e-02  1.153e-04  158.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.106 on 1178490 degrees of freedom
## Multiple R-squared:  0.5294, Adjusted R-squared:  0.5294 
## F-statistic: 1.105e+05 on 12 and 1178490 DF,  p-value: < 2.2e-16

 

For the present multiple regression model, the following hypothesis test will be taken. The null hypothesis states that the expected value of the coefficients is zero, while the alternative hypothesis assumes coefficients not equal to zero.

 

\[ \begin{aligned} H_0 &: \beta_1, \beta_2, \beta_3, \beta_4, \beta_5, \beta_6= 0 \\ H_A &: \beta_1, \beta_2, \beta_3, \beta_4, \beta_5, \beta_6\neq 0 \end{aligned} \]

 

According to the F test and its significance level of 2.2e-16, the null hypothesis should be rejected in favor of the alternative hypothesis, meaning that on average there is an evident relationship between the chosen explanatory variables and the outcome in the population. However, a deep look should be taken into the probability of rejecting the null hypothesis although it is true. For this reason, taking the test on a random population sample would more insightful and might lead us to a different test result.

 

Test and Training Data: Sampling

While sampling, it is important to be aware that the p-values are heavily influenced by the amount of data taken into consideration. Increasing the sample size, more often than not the null hypothesis will be rejected. Since the created multiple regression has been run on more than one million data points, the same model will be tested on a sample of 100 NYC Citi Bike users to test. Therefore, the main goal of the present section is testing if those coefficients declared significant is still meaningful if the regression is run on only 100 observations.

Following the regression output:

## 
## Call:
## lm(formula = log_rides ~ Age + User_Gender + User_Type + Season + 
##     Daily_Time_Slots + Avg_Temperature, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3964 -0.7330  0.1769  0.8935  2.4843 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.032995   1.137779   2.666  0.00916 ** 
## Age                            -0.041097   0.007949  -5.170 1.47e-06 ***
## User_GenderFemale              -1.098256   0.261586  -4.198 6.47e-05 ***
## User_TypeSubscriber             1.872893   0.291510   6.425 6.79e-09 ***
## SeasonSpring                   -0.398288   0.358581  -1.111  0.26974    
## SeasonSummer                   -0.305915   0.411599  -0.743  0.45934    
## SeasonWinter                   -0.381018   0.494468  -0.771  0.44305    
## Daily_Time_SlotsEarly Morning   0.205047   0.530153   0.387  0.69987    
## Daily_Time_SlotsEvening         0.261733   0.476952   0.549  0.58457    
## Daily_Time_SlotsLate Afternoon  0.129302   0.440135   0.294  0.76963    
## Daily_Time_SlotsLate Evening   -0.873120   0.535856  -1.629  0.10685    
## Daily_Time_SlotsLate Morning    0.024730   0.478235   0.052  0.95888    
## Avg_Temperature                 0.019692   0.015703   1.254  0.21320    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.206 on 87 degrees of freedom
## Multiple R-squared:  0.506,  Adjusted R-squared:  0.4378 
## F-statistic: 7.426 on 12 and 87 DF,  p-value: 3.197e-09

 

Based on the present regression output only the age, user gender, and user type coefficients can be considered significant, having a p-value between 0 and 0.01, which means that there is a strong evidence to reject the null hypothesis at the 1% level. In other words, the null hypothesis is unlikely to be true, considered each independent variable individually. Follow a summary of the regression coefficient insights:

Comparing the results obtained before and after the sampling, it is possible to conclude that “Season”, “Daily_Time_Slot”, “Avg_Temperature” and “User_Gender Age” have been considered erroneously meaningful. Therefore, each null hypothesis related to each variable has been wrongly rejected.

 

Illustrating the P-values: The User Gender Example

According to the discussion made in the previous section, the “User_Gender == Female” coefficient (beta_2) is meaningful and different from zero, meaning that the null hypothesis has to be rejected. In the present section, the “User_Gender == Female” coefficient test will be taken as a case study which will allow you to switch our hypothesis analysis into a graphical and more intuitive representation.

 

\[ \begin{aligned} H_0 &: \beta_2= 0 \\ H_A &: \beta_2\neq 0 \end{aligned} \]

 

To properly plot the “User_Gender == Female” two tailed test, two values have to been calculated: the test statistic and the t-value. Following the calculations:

# Test Statistic = User_Gender Coefficient Estimate / User_Gender Standard Error 
-0.929519/0.267087
## [1] -3.480211
# T-value for User_Gender == Female coefficient
qt(0.01, 87)
## [1] -2.369977
# P-value for the two tailed hypothesis test of User_Gender == Female coefficient
2*pt(-abs(-3.480211), df = 87)
## [1] 0.0007854096

 

The coefficient test statistic is equal to -3.480211, while the t critical value is equal to -2.369977. Moreover, the p-value for the two-tailed hypothesis test of “User_Gender == Female” coefficient is approximately 0.0005. Below the graphical representation of the test. With 87 degrees of freedom, the t critical values (red lines) borders the critical areas of the distribution. If the sample tested falls into one of those areas, the alternative hypothesis has to be accepted instead of the null hypothesis. In detail, the t-statistic (blue line) falls into the rejection region, further confirming the previous insights. Also from a graphical point of view, we can clearly see that the null hypothesis should be rejected. Moreover, such a small p-value, lower than 1% further proves that "User_gender “User_Gender” is significant.

 

 

 

Tided Model

Following the summarisation of the main “lm_citibike_final” model’s statistical findings.

 

 

 

Augmented Model

Following the augmented summarisation of the “lm_citibike_final” model.The function augment()has created a tibble, in which, for example, the predicted values and the residuals have been added.

 

 

Residual Analysis

For inferential regression, some conditions have to be met, such as the linearity of the relationship between variables or the normality of the residuals. This last condition deserves to be discussed in more detail. To check the normality of the residuals a histogram has been created. By analyzing the plotted chart below is it possible to see that there are more positive residuals than negative. Therefore, the residual distribution is slightly left-skew. However, the identified skewness is not that drastic, and the residual distribution looks very close to the normal one. For this reason, it is possible to state that the normality condition of the residual has not been violated. As a consequence, this might suggest a correct model specification.

 

 

Last Step: Calculating the Average Error

The below calculation reveals an average error equal to 0. As a consequence, this might suggest a correct model specification.

 

 

 

Findings and Conclusion

The interest in investigating the consumer’s behavior towards the use of “Citi Bike” in New York City (NYC) from 2018 to 2020 was triggered by a growing demand for greener ways of transportation by New Yorkers. This was caused both by consumer’s shift towards eco-friendly habits and the coronavirus pandemic.

The analysis led into investigating how the user age, the user gender, the user type, the season, the daily time slots and the average temperature affect the NYC Citi Bike rides. With use of OLS Modelling, it was found that the “statistically significant” explanatory variables are Age, User Gender and User Type.

Firstly, there is a negative relationship between age and rides count. The more the population is aging, the less it is inclined to cycle. Secondly, it was discovered that while males lead the user segment, female users are increasing faster than males. It was confirmed by Schmitt (2019) that understanding the Age profiles and gap between male and female cyclists, it is important to push the growth and improve the bicycle network as a whole. The gap in NYC closely reflects the national trend of one female for every three male cyclists (NYC, 2019). This report confirms the fact that overall the cycling population is growing, and Citi Bike data reveals that growth among female cyclists is outpacing growth among male cyclists (NYC, 2019).

Last but not least, an important underlying discovery is that Subscribers, over Occasional customers, lead Citi Bikes ridership. However, neither the user type nor the Season were found to be statistically significant and don’t explain the variation in rides.

To conclude, the authors have identified that the profile of the top customer of Citi Bike is a 44 years old Male which is a subscriber. The preference of customers is to ride during Summer and Late Afternoon, even if these last two variables weren’t found to be statistically significant.

 

 

Limitations and Recommendations for the Future Research

This analysis and findings are limited to this dataset. Although the explanatory variables chosen were able to explain the 53% of the variation, the authors recommend to future researchers to expand this research. Firstly, it is suggested to use more datasets in order to find more variables to test in the multiple regression model. This would create a more in-depth result. Secondly, it could also be interesting to evaluate the use of bikes in NYC, not only those provided by Citi Bike, but also other providers or personal bikes.

Finally, the authors cannot claim that the model created is perfect as for example it uses imperfect data (Birth Dates) and omits some variables that could be added in future research. However, the authors believe it does a pretty good job quantifying mainly the impacts of the type of Citi Bike rider.

 

 

References